What You’ll Do:
CoreWeave is building and operating some of the largest GPU infrastructure in the world. The Metal Net team owns the high‑bandwidth GPU interconnect platforms that make large‑scale AI and HPC workloads possible, including NVLink and NVSwitch‑based systems. We deploy, operate, troubleshoot, and improve these platforms across our global data centre footprint to provide a powerful alternative to traditional hyperscalers.
About the role:
We are looking for an HPC Engineer to join our team to deploy, operate, and support NVLink/NVSwitch platforms across large data centre environments. This role is a strong fit for engineers who enjoy production troubleshooting, hardware‑adjacent systems work, automation, observability, and learning specialized infrastructure deeply. You will be responsible for troubleshooting Linux, networking, hardware, firmware, performance, and stability issues in production, while building automation to improve runbooks, dashboards, alerts, and lifecycle workflows. Additionally, you will participate in rotating on‑call shifts, lead incident responses, conduct root cause analyses, and collaborate cross‑functionally across CoreWeave to ensure reliable workflows scale effectively as our global fleet grows.
Who You Are:
- Strong Linux system administration and engineering troubleshooting skills.
- Solid grasp of networking fundamentals and common diagnostic/troubleshooting tools.
- Hands‑on production debugging experience using logs, metrics, and command‑line interfaces.
- Technical experience troubleshooting server, network, GPU, or data centre hardware.
- Practical scripting or automation experience using Python, Go, Bash, or similar languages.
- Clear written and verbal communication, documentation skills, and readiness to participate in an on‑call rotation.
- High curiosity to deeply learn specialized GPU interconnect technologies such as NVLink, NVSwitch, and InfiniBand.
Preferred:
- Experience with Ansible or other infrastructure‑as‑code and configuration automation tooling.
- Kubernetes application development or live platform operations experience.
- Familiarity with modern observability systems, including Grafana, Prometheus, PromQL, or similar stack components.
- Experience managing large fleet operations across Linux systems, network devices, GPUs, or infrastructure components.
- Deep understanding of InfiniBand, RDMA, HPC networking, or low‑latency/high‑bandwidth fabrics.
- Experience with BMC, Redfish, IPMI, firmware lifecycle management, or hardware management APIs.
- Exposure to NVLink, NVSwitch, NVIDIA GPU platforms, NVUE, SONiC, or specialized network operating systems.
Benefits:
- Family‑level Medical Insurance
- Family‑level Dental Insurance
- Generous Pension Contribution
- Life Assurance at 4x Salary
- Critical Illness Cover
- Employee Assistance Programme
- Tuition Reimbursement
- Work culture focused on innovative disruption
The base salary range for this role is £79,000 to £105,000. The starting salary will be determined based on job‑related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).
Equal Opportunity:
CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.
#J-18808-Ljbffr…
