Requirements
- Bachelor’s Degree or equivalent experience in Computer Science, Engineering, or a related field
- 5+ years of hands‑on technical experience in SRE, Platform Engineering, Infrastructure, or related roles
- Strong experience with AWS (or Azure), including services such as EKS, ECS, EC2, networking, IAM, and managed services
- Solid understanding of cloud security principles and experience collaborating with security teams
- Strong background in Linux systems administrations
- Proven experience designing and operating observability platforms, including monitoring, logging, and alerting
- Hands‑on experience with Datadog for metrics, logs, APM, and alerting
- Strong understanding of SRE principles, including SLOs, error budgets, incident management, and reliability engineering
- Experience working closely with architecture and engineering teams on system design and delivery
- Experience with cloud cost optimization strategies and tooling
- (Desirable) Experience supporting multi‑cloud or hybrid environments
- (Desirable) Exposure to Infrastructure as Code (e.g., Terraform, CloudFormation)
- (Desirable) Experience in large‑scale, complex, or regulated environments
- (Desirable) Knowledge of vector databases and RAG architectures for building internal SRE knowledge assistants
- (Desirable) Knowledge of Generative AI and LLM platforms (e.g., Claude, Amazon Bedrock)
- Strong technical authority with the ability to influence design and operational decisions
- Highly collaborative, comfortable working across architecture, engineering, security, and operations teams
- Calm and methodical under pressure, especially during incidents and critical issues
- Pragmatic problem‑solver who balances reliability, security, cost, and delivery speed
- Clear communicator, able to explain complex technical concepts to diverse audiences
What the job involves
- We are evolving our Site Reliability Engineering capabilities to strengthen reliability, observability, security, and operational excellence across our Risk Intelligence division
- As a Senior SRE, you will be a senior hands‑on technical person help shape the foundations of reliability across both new and existing platforms
- You will collaborate with Architecture, Engineering, Security, and Platform teams to ensure reliability is built into systems from day one
- While this is not a people‑management you will work closely with global teams and may occasionally be called upon for major incidents or critical issues
- This position requires a highly proactive, hard‑working expert with strong leadership presence and ownership of platform reliability outcomes
- We are looking for a person who is passionate about reliability engineering and who bring a continuous improvement approach to everything they do!
- Lead the establishment of SRE foundations for new projects building environments, monitoring, alerting, and ensuring operational readiness from day one
- Define, implement, and champion observability standards, tooling, and guidelines across metrics, logs, traces, and SLIs/SLOs
- Design and evolve monitoring and alerting solutions that improve visibility, reduce toil, and strengthen system health
- Continuously drive reliability improvements across our environments through incident reduction, performance tuning, and building resilient patterns
- Partner with Security teams to ensure our platforms meet compliance, security, and risk‑management expectations
- Influence architectural and design decisions through data‑driven cloud cost optimization and efficiency initiatives
- Be a technical leader and mentor supporting engineers, shaping engineering standards, and fostering a culture of learning and development
#J-18808-Ljbffr…
