Vice President – Site Reliability Engineer – London
BNY is seeking a Vice President – Site Reliability Engineer to design, build, deploy, and scale resilient, automated, and centrally managed engineering solutions for Production Services. This role is ideal for a strong full‑stack engineer who combines application development, UI engineering, backend services, infrastructure automation, and production reliability expertise.
Responsibilities
- Design, develop, and deploy centralized engineering solutions that improve operational efficiency, reduce toil, and enhance resiliency across Production Services.
- Build full-stack applications and internal engineering tools, including backend services, APIs, automation layers, and user-facing interfaces using technologies such as Python, Java, React, or Angular.
- Engineer scalable solutions that support central operational use cases such as self‑service tooling, operational dashboards, alert enrichment, incident reduction, service recovery, and workflow automation.
- Develop reusable frameworks and components that can be adopted broadly across Production Services teams to standardize and accelerate operational processes.
- Automate infrastructure, deployment, configuration, and runtime support activities using tools such as Ansible and Kubernetes.
- Define, implement, and continuously improve Service Level Indicators, Service Level Objectives, and service health measures aligned to operational and business priorities.
- Build and optimize monitoring, observability, and alerting capabilities using tools such as Prometheus, Grafana, AppDynamics, and Splunk.
- Apply AIOps capabilities to improve event correlation, anomaly detection, root cause analysis, predictive insights, and proactive issue prevention.
- Partner with engineering, infrastructure, production support, security, and risk teams to ensure developed solutions are secure, scalable, supportable, and aligned to enterprise standards.
- Identify manual, fragmented, or repetitive processes across Production Services and convert them into efficient, automated, centrally consumable solutions.
Required Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related technical discipline, or equivalent practical experience.
- Strong full‑stack development experience, with hands‑on expertise in Python and Java for backend or service‑layer engineering.
- Strong working knowledge of front‑end development using React or Angular, including building interfaces for operational or engineering use cases.
- Proven experience designing and deploying end‑to‑end solutions, from application development through production deployment and operational support.
- Experience in Site Reliability Engineering, Production Engineering, DevOps, Platform Engineering, or similar roles supporting business‑critical applications.
- Strong foundation in Linux/Unix systems administration, scripting, troubleshooting, and infrastructure concepts.
- Hands‑on experience with Ansible and Kubernetes in enterprise or production environments.
- Demonstrated ability to define and operationalize SLIs, SLOs, dashboards, alerts, and health indicators.
- Hands‑on experience with enterprise monitoring and observability platforms including Prometheus, Grafana, AppDynamics, and Splunk.
- Strong troubleshooting, analytical, and problem‑solving skills in complex distributed or production environments.
- Strong verbal and written communication skills, with the ability to collaborate effectively across technical and non‑technical stakeholders.
Preferred Qualifications
- Experience building centralized internal platforms or shared engineering services for operational or enterprise users.
- Experience applying AIOps, machine learning, or intelligent automation within production support or reliability engineering environments.
- Exposure to CI/CD pipelines, infrastructure as code, API‑driven automation, and modern software delivery practices.
- Experience supporting distributed systems, cloud‑native platforms, or container‑based architectures.
- Knowledge of Agile, DevOps, and SRE operating models, including continuous improvement and blameless post‑incident practices.
- Ability to influence engineering standards and drive adoption of common tooling and automation patterns across teams.
#J-18808-Ljbffr…
