Software Engineer - Site Reliability Engineering
London
Neo4j is the graph intelligence platform that transforms data into knowledge to power the next generation of intelligent applications and AI systems.
The Site Reliability Engineering team’s mission is to improve the reliability of Neo4j’s DBaaS product: Neo4j Aura. Operating at a global scale across all three major cloud providers, Aura runs hundreds of Kubernetes clusters and hosts thousands of Neo4j instances in production at any given time.
We’re reshaping what SRE means at Neo4j Aura—and we want you to be part of that journey.
Rather than firefighting or chasing alerts, we’re helping teams design for reliability from day one. That means building the tools, practices, and culture that embed SRE principles at the heart of how Aura operates. You’ll be joining a team focused on long‑term resilience, engineering excellence, and meaningful collaboration with product teams.
The Role
- Automate for insight and scale: Build systems that make troubleshooting fast, safe, and scalable across thousands of Neo4j instances. From internal tools that surface clear insights to canaries that support safe rollouts, you’ll focus on automation that elevates reliability engineering.
- Treat operations as a software problem: Replace tribal knowledge and ad‑hoc scripts with tools and systems that codify best practices—making operations predictable, scalable, and repeatable.
- Design for resilience, learn from failure: Own and evolve the tooling and processes behind incident response. From clear alerts to blameless reviews, you’ll help ensure teams respond with confidence and learn with clarity.
- Champion reliability as a product feature: Help teams define and act on SLIs and SLOs, turning reliability into a shared, data‑driven priority across engineering.
- Create signals, not noise: Shape an observability stack that tells us what matters, when it matters—so we can detect issues early and resolve them quickly.
We're interested in hearing from Engineers with deep experience in some of the following areas
- Writing backend tools and automation in Go—our primary language—with an emphasis on sound architecture, testing, and maintainability. Strong software skills in other languages, like Python, are also welcome.
- Applying SRE practices in real‑world environments: defining SLIs and SLOs, reducing toil through automation, and driving reliability through engineering.
- Collaborating with other teams to promote SRE thinking—educating on principles like observability, ownership, and service level objectives.
- Troubleshooting large‑scale, cloud‑based systems with confidence and curiosity.
- Monitoring distributed systems and understanding their performance characteristics.
- Designing systems with reliability, safety, and debugability as first‑class concerns.
- Working with observability tools like OTel Collector, Prometheus, Grafana, and Google Cloud’s operations suite.
- Deploying and managing applications on Kubernetes; cluster‑level administration is a plus.
- Managing infrastructure with Kustomize and Terraform—keeping it clear, modular, and easy to evolve.
- Building and maintaining CI/CD workflows—ours run on GitHub Actions.
- Participating in on‑call rotations and incident response with a focus on improvement, not blame.
- Writing and contributing to post‑mortems that lead to meaningful, lasting changes.
Research shows that members of underrepresented communities are less likely to apply for jobs when they don’t meet all the qualifications. If this is part of the reason you hesitate to apply, we’d encourage you to reconsider and give us the opportunity to review your application. At Neo4j, we are committed to building awareness and helping to improve these issues.
One of our central objectives is to provide an inclusive, diverse, and equitable workplace for everyone to develop their potential and have a positive, career‑defining experience. We look forward to receiving your application.
#J-18808-Ljbffr