Site Reliability Engineer

Company: Intermedia Intelligent Communications
Apply for the Site Reliability Engineer
Location:
Job Description:

ALL CANDIDATES MUST BE LOCATED IN THE UK

About the Role

We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. The role involves building and maintaining monitoring solutions using Prometheus or VictoriaMetrics, integrating alerts and events with BigPanda, and participating in on‑call rotations to drive fast incident response and continuous improvement across Windows and Linux environments.

Key Responsibilities

  • Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
  • Design and maintain alerting strategy: thresholds, anomaly detection, alert routing, deduplication, and noise reduction
  • Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
  • Create and maintain dashboards and operational visibility (Grafana or equivalent)
  • Develop and maintain runbooks, operational playbooks, and incident response procedures
  • Participate in on‑call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
  • Perform root‑cause analysis, post‑mortems, and implement corrective/preventive actions
  • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
  • Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
  • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)

Skills, Knowledge & Expertise

  • Experience in SRE / Operations / DevOps with production incident ownership
  • Hands‑on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
  • Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
  • Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
  • Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
  • Experience with Git‑based workflows for monitoring‑as‑code and configuration management

Nice to Have

  • Grafana administration and dashboard design
  • Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
  • Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
  • Messaging/cache/proxy operations: RabbitMQ, Redis, NGINX
  • Experience with Windows clustering or HA environments
  • Experience defining SLOs/SLIs and operational KPIs
  • Experience managing VOIP components and protocols (SIP, FreeSwitch, OpenSIP, session border controllers)
  • Experience with load‑balancing components (F5 LTM, F5 GTM)
  • Experience with virtualization platforms such as VMWare or HyperV
  • Experience administering AWS or Azure tenants

On‑call Expectations

  • Participation in a rotating on‑call schedule (including nights/weekends as needed)
  • Ownership of incident response: rapid triage, escalation, mitigation, and follow‑up improvements
  • Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR

Diversity, Inclusion, and Equal Opportunity

We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or other bases protected by applicable law. We are an equal‑opportunity employer and value diversity at our company.

#J-18808-Ljbffr…

Posted: April 17th, 2026