SRE (Linux, Firmware & Server Infrastructure)

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “SRE (Linux, Firmware & Server Infrastructure)”, “description”: “Contract: Senior Platform Reliability Engineer (Linux, Firmware & Server Infrastructure)

Is this the next step in your career Find out if you are the right candidate by reading through the complete overview below.
Location: Glasgow (Hybrid – 3 days onsite)
Duration: 6 months
Day Rate: Negotiable (Inside IR35 via umbrella)
Reference: 20460
Overview
We are seeking a Senior Platform Reliability Engineer with deep Linux systems expertise and strong exposure to server hardware, firmware, and low-level infrastructure operations. This role sits within a high-performing enterprise infrastructure team responsible for maintaining and improving the reliability of critical platforms at scale.
The position is heavily focused on resolving complex platform and hardware-related incidents, particularly those escalated from L3 support, with an emphasis on firmware lifecycle management, disk encryption, logging, and server configuration (BIOS-level controls) across multi-vendor environments.
This is a hands-off hardware role, requiring strong remote troubleshooting capabilities, excellent communication skills, and the ability to work closely with internal teams and external vendors to drive issues through to resolution.
Key Responsibilities
Own and manage end-to-end incident resolution for platform and hardware-related issues, including triage, mitigation, escalation, and post-incident review
Diagnose and troubleshoot Linux OS-level issues arising from hardware faults, firmware changes, or configuration inconsistencies
Manage and support firmware lifecycle processes, including upgrades, validation, and issue remediation
Work with disk encryption technologies and logging frameworks, ensuring system integrity and auditability
Maintain and troubleshoot server configuration settings, including BIOS-level parameters across multiple hardware vendors (strong Dell focus)
Utilize out-of-band management tools (e.g., iDRAC, iLO, RACADM, Redfish APIs) for remote diagnostics and recovery
Analyse vendor logs, support bundles, and telemetry data to identify root causes and remediation paths
Engage directly with hardware vendors and engineering teams, managing escalations and driving timely resolutions
Contribute to continuous improvement initiatives, reducing incident recurrence and operational toil
Produce and maintain high-quality documentation, including runbooks, troubleshooting guides, and knowledge base articles
Participate in post-incident reviews (RCA) and support improvements in reliability metrics (MTTR, MTTD, SLOs)
Essential Skills & Experience
Strong Linux administration and troubleshooting expertise, including:
Process and service management
System logs and diagnostics
Networking fundamentals
Package and configuration management
Solid understanding of server hardware and infrastructure, including:
Disks, RAID/HBA controllers
NICs and firmware interactions
Hardware failure modes and OS-level symptoms
Proven experience with:
Firmware management and upgrades
Disk encryption and secure configurations
BIOS/server configuration management
Hands-on experience with remote management and lights-out technologies, such as:
iDRAC, iLO
RACADM
Redfish or similar APIs
Strong track record of incident ownership, including:
Triage and mitigation
Cross-team coordination
Stakeholder communication
Driving issues through to resolution
Experience working with:
Vendor diagnostics, logs, and support bundles
Vendor escalation processes and engineering engagement xwzovoh
Excellent communication skills (written and verbal), with the ability to clearly articulate technical issues to both technical and non-technical stakeholders
Strong documentation skills, including creation of runbooks, procedures, and RCA reports
Desirable Skills
Scripting and automation experience (e.g., Python, Bash, Ansible)
Familiarity with configuration management and automation frameworks
Exposure to virtualisation and containerisation technologies (VMware, KVM, Docker, Kubernetes)
Experience with monitoring, observability, and alerting systems, including log analysis and alert tuning
Understanding of SRE principles and metrics, including SLOs, SLIs, error budgets, MTTR/MTTD
Key Attributes
Methodical and detail-oriented approach to troubleshooting
Strong sense of ownership and accountability
Comfortable working in high-pressure, incident-driven environments
Collaborative mindset with the ability to work across global teams and vendors
Proactive approach to continuous improvement and operational excellence
Networking People (UK) is acting as an Employment Business in relation to this vacancy.”, “datePosted”: “2026-05-21”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Networking People Limited”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__438260071__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861&geoID=1799” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “Anderston” } } }

Company: Networking People Limited

Apply for the SRE (Linux, Firmware & Server Infrastructure)

Location: Anderston

Job Description:

Contract: Senior Platform Reliability Engineer (Linux, Firmware & Server Infrastructure)

Is this the next step in your career Find out if you are the right candidate by reading through the complete overview below.Location: Glasgow (Hybrid – 3 days onsite)Duration: 6 monthsDay Rate: Negotiable (Inside IR35 via umbrella)Reference: 20460OverviewWe are seeking a Senior Platform Reliability Engineer with deep Linux systems expertise and strong exposure to server hardware, firmware, and low-level infrastructure operations. This role sits within a high-performing enterprise infrastructure team responsible for maintaining and improving the reliability of critical platforms at scale.The position is heavily focused on resolving complex platform and hardware-related incidents, particularly those escalated from L3 support, with an emphasis on firmware lifecycle management, disk encryption, logging, and server configuration (BIOS-level controls) across multi-vendor environments.This is a hands-off hardware role, requiring strong remote troubleshooting capabilities, excellent communication skills, and the ability to work closely with internal teams and external vendors to drive issues through to resolution.Key ResponsibilitiesOwn and manage end-to-end incident resolution for platform and hardware-related issues, including triage, mitigation, escalation, and post-incident reviewDiagnose and troubleshoot Linux OS-level issues arising from hardware faults, firmware changes, or configuration inconsistenciesManage and support firmware lifecycle processes, including upgrades, validation, and issue remediationWork with disk encryption technologies and logging frameworks, ensuring system integrity and auditabilityMaintain and troubleshoot server configuration settings, including BIOS-level parameters across multiple hardware vendors (strong Dell focus)Utilize out-of-band management tools (e.g., iDRAC, iLO, RACADM, Redfish APIs) for remote diagnostics and recoveryAnalyse vendor logs, support bundles, and telemetry data to identify root causes and remediation pathsEngage directly with hardware vendors and engineering teams, managing escalations and driving timely resolutionsContribute to continuous improvement initiatives, reducing incident recurrence and operational toilProduce and maintain high-quality documentation, including runbooks, troubleshooting guides, and knowledge base articlesParticipate in post-incident reviews (RCA) and support improvements in reliability metrics (MTTR, MTTD, SLOs)Essential Skills & ExperienceStrong Linux administration and troubleshooting expertise, including:Process and service managementSystem logs and diagnosticsNetworking fundamentalsPackage and configuration managementSolid understanding of server hardware and infrastructure, including:Disks, RAID/HBA controllersNICs and firmware interactionsHardware failure modes and OS-level symptomsProven experience with:Firmware management and upgradesDisk encryption and secure configurationsBIOS/server configuration managementHands-on experience with remote management and lights-out technologies, such as:iDRAC, iLORACADMRedfish or similar APIsStrong track record of incident ownership, including:Triage and mitigationCross-team coordinationStakeholder communicationDriving issues through to resolutionExperience working with:Vendor diagnostics, logs, and support bundlesVendor escalation processes and engineering engagement xwzovoh Excellent communication skills (written and verbal), with the ability to clearly articulate technical issues to both technical and non-technical stakeholdersStrong documentation skills, including creation of runbooks, procedures, and RCA reportsDesirable SkillsScripting and automation experience (e.g., Python, Bash, Ansible)Familiarity with configuration management and automation frameworksExposure to virtualisation and containerisation technologies (VMware, KVM, Docker, Kubernetes)Experience with monitoring, observability, and alerting systems, including log analysis and alert tuningUnderstanding of SRE principles and metrics, including SLOs, SLIs, error budgets, MTTR/MTTDKey AttributesMethodical and detail-oriented approach to troubleshootingStrong sense of ownership and accountabilityComfortable working in high-pressure, incident-driven environmentsCollaborative mindset with the ability to work across global teams and vendorsProactive approach to continuous improvement and operational excellenceNetworking People (UK) is acting as an Employment Business in relation to this vacancy….

Posted: May 21st, 2026